Chapter 1: Linear models

Big picture

This course builds on an understanding of the mechanics of linear models. Here we introduce some key topics that will facilitate future understanding hierarchical models.

Learning goals

  • linear regression with lm
  • intercepts, “categorical” effects
  • varying model structure to estimate effects and standard errors
  • interactions as variation in slope estimates for different groups
  • centering input variables and intepreting resulting parameters
  • assumptions and unarticulated priors
  • understanding residual variance (Gaussian)
  • understanding all of the above graphically
  • understanding and plotting output of lm
  • notation and linear algebra review: \(X\beta\)

Linear regression, ANOVA, ANCOVA, and multiple regression models are all species cases of general linear models (hereafter “linear models”). In all of these cases, we have observed some response variable \(y\), which is potentially modeled as a function of some covariate(s) \(x_1, x_2, ..., x_p\).

Model of the mean

If we have no covariates of interest, then we may be interested in estimating the population mean and variance of the random variable \(Y\) based on \(n\) observations, corresponding to the values \(y_1, ..., y_n\). Here, capital letters indicate the random variable, and lowercase corresponds to realizations of that variable. This model is sometimes referred to as the “model of the mean”.

# simulating a sample of y values from a normal distribution
y <- rnorm(20)
plot(y)
A set of observed y values, n=20.

A set of observed \(y\) values, \(n=20\).

We have two parameters to estimate: the mean of \(Y\), which we’ll refer to as \(\mu\), and the variance of \(Y\), which we’ll refer to as \(\sigma^2\). Here, and in general, we will use greek letters to refer to parameters. If \(Y\) is normally distributed, then we can assume that the realizations or samples \(y\) that we observe are also normally distributed: \(y \sim N(\mu, \sigma^2)\). Here and elsewhere, the \(\sim\) symbol represents that some quantity “is distributed as” something else (usually a probability distribution). You can also think of \(\sim\) as meaning “is sampled from”. A key concept here is that we are performing statistical inference, meaning we are trying to learn about (estimate) population-level parameters with sample data. In other words, we are not trying to learn about the sample mean \(\bar{y}\) or sample variance of \(y\). These can be calculated and treated as known once we have observed a particular collection of \(y\) values. The unknown quantities \(\mu\) and \(\sigma^2\) are the targets of inference.

Fitting this model (and linear models in general) is possible in R with the lm function. For this rather simple model, we can estimate the parameters as follows:

# fitting a model of the mean with lm
m <- lm(y ~ 1)
summary(m)
## 
## Call:
## lm(formula = y ~ 1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6291 -0.5263 -0.1141  0.8091  1.6353 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   0.3617     0.2108   1.716    0.102
## 
## Residual standard error: 0.9426 on 19 degrees of freedom

The summary of our model object m provides a lot of information. For reasons that will become clear shortly, the estimated population mean is referred to as the “Intercept”. Here, we get a point estimate for the population mean \(\mu\): 0.362 and an estimate of the residual standard deviation \(\sigma\): 0.943, which we can square to get an estimate of the residual variance \(\sigma^2\): 0.888.

Linear regression

Often, we are interested in estimating the mean of \(Y\) as a function of some other variable, say \(X\). Simple linear regression assumes that \(y\) is again sampled from a normal distribution, but this time the mean or expected value of \(y\) is a function of \(x\):

\[y_i \sim N(\mu_i, \sigma^2)\]

\[\mu_i = \alpha + \beta x_i\]

Here, subscripts indicate which particular value of \(y\) and \(x\) we’re talking about. Specifically, we observe \(n\) pairs of values: \((y_i, x_i), ..., (y_n, x_n)\), with all \(x\) values known exactly. Linear regression models can equivalently be written as follows:

\[y_i = \alpha + \beta x_i + \epsilon_i\]

\[\epsilon_i \sim N(0, \sigma^2)\]

Key assumptions here are that each of the error terms \(\epsilon_1, ..., \epsilon_n\) are normally distributed around zero with some variance (i.e., the error terms are identically distributed), and that the value of \(\epsilon_1\) does not affect the value of any other \(\epsilon\) (i.e., the errors are independent). This combination of assumptions is often referred to as “independent and identically distributed” or i.i.d. Equivalently, given some particular \(x_i\) and a set of linear regression parameters, the distribution of \(y_i\) is normal. A common misconception is that linear regression assumes the distribution of \(y\) is normal. This is wrong - linear regression assumes that the error terms are normally distributed. The assumption that the variance \(\sigma^2\) is constant for all values of \(x\) is referred to as homoskedasticity. Rural readers may find it useful to think of skedasticity as the amount of “skedaddle” away from the regression line in the \(y\) values. If the variance is changing across values of \(x\), then the assumption of homoskedasticity is violated and you’ve got a heteroskedasticity problem.

# simulate and plot x and y values
n <- 50
x <- runif(n)
alpha <- -2
beta <- 3
sigma <- .4
y <- rnorm(n, mean = alpha + beta * x, sd = sigma)
plot(x, y)

# add known mean function 
lines(x = x, y = alpha + beta * x, col='blue')
legend('topleft', 
       pch = c(1, NA), lty = c(NA, 1), 
       col = c('black', 'blue'), 
       legend = c('Observed data', 'E(y | x)'), 
       bty = 'n')
Simulated data from a linear regression model. The true expected value of y given x, E(y | x), is shown as a line.

Simulated data from a linear regression model. The true expected value of y given x, E(y | x), is shown as a line.

The normality assumption means that the probability density of \(y\) is highest at the value \(\alpha + \beta x\), where the regression line is, and falls off away from the line according to the normal probablity density. This graphically looks like a bell ‘tube’ along the regression line, adding a dimension along \(x\) to the classic bell ‘curve’.

Graphical depiction of the linear regression normality assumption. The probability density of y is shown in color. Higher probabilities are shown as more intense colors, and regions with low probabilities are lighter.

Graphical depiction of the linear regression normality assumption. The probability density of y is shown in color. Higher probabilities are shown as more intense colors, and regions with low probabilities are lighter.

Model fitting

Linear regression parameters \(\alpha\), \(\beta\), and \(\sigma^2\) can be estimated with lm. The syntax is very similar to the previous model, except now we need to include our covariate x in the formula (the first argument to the lm function).

m <- lm(y ~ x)
summary(m)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.83406 -0.17346  0.03494  0.23322  0.63511 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.03298    0.08625  -23.57   <2e-16 ***
## x            3.16746    0.15594   20.31   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3403 on 48 degrees of freedom
## Multiple R-squared:  0.8958, Adjusted R-squared:  0.8936 
## F-statistic: 412.6 on 1 and 48 DF,  p-value: < 2.2e-16

The point estimate for the parameter \(\alpha\) is called “(Intercept)”. This is because our estimate for \(\alpha\) is the y-intercept of the estimated regression line when \(x=0\) (recall that \(y_i = \alpha + \beta x_i + \epsilon_i\)). The estimate for \(\beta\) is called “x”, because it is a coefficient associated with the variable “x” in this model. This parameter is often referred to as the “slope”, because it represents the increase in the expected value of \(y\) for a one unit increase in \(x\) (the rise in y over run in x). Point estimates for the standard deviation and variance of \(\epsilon\) can be extracted as before (summary(m)$sigma and summary(m)$sigma^2).

Centering and scaling covariates

Often, it’s a good idea to “center” covariates so that they have a mean of zero (\(\bar{x} = 0\)). This is achieved by subtracting the sample mean of a covariate from the vector of covariate values (\(x - \bar{x}\)). It’s also useful to additionally scale covariates so that they are all on a common and unitless scale. While many will divide each covariate by its standard deviation, Gelman and Hill (pg. 57) recommend dividing by twice the standard deviation (\(s_x\)) so that binary covariates are transformed from \(x \in \{0, 1\}\) to \(x_t \in \{-0.5, 0.5\}\), where \(x_t\) is the transformed covariate: \(x_t = \frac{x - \bar{x}}{2 s_x}\). If covariates are not centered and scaled, then it is common to observe correlations between estimated slopes and intercepts.

Linear regression line of best fit with 95% confidence intervals for the line. Notice that if the slope is lower (the upper dashed line) then the intercept necessarily goes up, and if the slope is higher (the lower dashed line), the intercept must decrease.

Linear regression line of best fit with 95% confidence intervals for the line. Notice that if the slope is lower (the upper dashed line) then the intercept necessarily goes up, and if the slope is higher (the lower dashed line), the intercept must decrease.

So, we expect that in this case, the estimates for the intercept and slope must be negatively correlated. This is bourne out in the confidence region for our estimates of \(\alpha\) and \(\beta\). Usually, people inspect univariate confidence intervals for parameters, e.g.,

confint(m)
##                 2.5 %    97.5 %
## (Intercept) -3.872117 -2.693003
## x            0.903557  1.241512

This is misleading because our estimates for these parameters are correlated. For any given value of the intercept, there are only certain values of the slope that are supported. To assess this possibility, we might also be interested in the bivariate confidence ellipse for these two parameters. We can evaluate this quantity graphically as follows with some help from the car package:

library(car)
confidenceEllipse(m)

This is not great. We want to be able to directly use the univariate confidence intervals. Our problem can be solved by centering \(x\):

Now there is no serious correlation in the estimates and we are free to use the univariate confidence intervals without needing to consider the joint distribution of the slope and intercept. This trick helps with interpretation, but it will also prove useful later in the course in the context of Markov chain Monte Carlo (MCMC) sampling.

Checking assumptions

We have assumed that the distribution of error terms is normally distributed, and this assumption is worth checking. Below, we plot a histogram of the residuals (another name for the \(\epsilon\) parameters) along with a superimposed normal probability density so that we can check normality.

hist(resid(m), breaks = 20, freq = F, 
     main = 'Histogram of model residuals')
curve_x <- seq(min(resid(m)), max(resid(m)), .01)
lines(curve_x, dnorm(curve_x, 0, summary(m)$sigma))
Simulated data from a linear regression model.

Simulated data from a linear regression model.

Even when the assumption of normality is correct, it is not always obvious that the residuals are normally distributed. Another useful plot for assessing normality of errors is a quantile-quantile or Q-Q plot. If the residuals do not deviate much from normality, then the points in a Q-Q plot won’t deviate much from the dashed one-to-one line. If points lie above or below the line, then the residual is larger or smaller, respectively, than expected based on a normal distribution.

plot(m, 2)
A quantile-quantile plot to assess normality of residuals.

A quantile-quantile plot to assess normality of residuals.

To assess heteroskedasticity, it is useful to inspect a plot of the residuals vs. fitted values, e.g. plot(m, 1). If it seems as though the spread or variance of residuals varies across the range of fitted values, then it may be worth worrying about homoskedasticity and trying some transformations to fix the problem.

Analysis of variance

Sometimes, the covariate of interest is not continuous but instead categorical (e.g., “chocolate”, “strawberry”, or “vanilla”). We might again wonder whether the mean of a random variable \(Y\) depends on the value of this covariate. However, we cannot really estimate a meaningful “slope” parameter, because in this case \(x\) is not continuous. Instead, we might formulate the model as follows:

\[y_i \sim N(\alpha_{j[i]}, \sigma^2)\]

Where \(\alpha_j\) is the mean of group \(j\), and we have \(J\) groups total. The notation \(\alpha_{j[i]}\) represents the notion that the \(i^{th}\) observation corresponds to group \(j\), and we are going to assume that all observations in the \(j^{th}\) group have the same mean, \(\alpha_j\). The above model is perfectly legitimate, and our parameters to estimate are the group means \(\alpha_1, ..., \alpha_J\) and the residual variance \(\sigma^2\). This parameterization is called the “means” parameterization, and though it is perhaps easier to understand than the following alternative, it is less often used.

This model is usually parameterized not in terms of the group means, but rather in terms of an intercept (corresponding to the mean of one “reference” group), and deviations from the intercept (differences between a group of interest and the intercept). For instance, in R, the group whose mean is the intercept (the “reference” group) will be the group whose name comes first alphabetically. Either way, we will estimate the same number of parameters. So if our groups are “chocolate”, “strawberry”, and “vanilla”, R will assign the group “chocolate” to be the intercept, and provide 2 more coefficient estimates for the difference between the estimated group mean of strawberry vs. chocoloate, and vanilla vs. chocolate.

This parameterization can be written as

\[y_i \stackrel{iid}{\sim} N(\mu_0 + \beta_{j[i]}, \sigma^2)\]

where \(\mu_0\) is the “intercept” or mean of the reference group, and \(\beta_j\) represents the difference in the population mean of group \(j\) compared to the reference group (if \(j\) is the reference group, the \(\beta_j = 0\)). Traditionally this model is called simple one-way analysis of variance, but we view it simply as another special case of a linear model.

The following example illustrates some data simulation, visualization, and parameter estimation in this context. Specifically, we assess 60 humans for their taste response to three flavors of iced cream. We want to extrapolate from our sample to the broader population of all ice cream eating humans to learn whether in general people think ice cream tastiness varies as a function of flavor.

# simulate and visualize data
n <- 60
x <- rep(c("chocolate", "strawberry", "vanilla"), length.out = n)
x <- factor(x)
sigma <- 1
mu_y <- c(chocolate = 3.352, strawberry = .93, vanilla = 1.5)
y <- rnorm(n, mu_y[x], sigma)

library(ggplot2)
ggplot(data.frame(x, y), aes(x, y)) + 
  geom_jitter(position = position_jitter(width=.1)) + 
  xlab('Group') + 
  ylab('Tastiness')

Model fitting

We can estimate our parameters with the lm function (this should be a strong hint that there are not huge differences between linear regression and ANOVA). The syntax is exactly the same as with linear regression. The only difference is that our input x is not numeric, it’s a character vector.

m <- lm(y ~ x)
summary(m)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.01885 -0.53330 -0.08907  0.60145  2.57251 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.9887     0.2162  13.823  < 2e-16 ***
## xstrawberry  -1.9144     0.3058  -6.261 5.36e-08 ***
## xvanilla     -1.4427     0.3058  -4.718 1.59e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.967 on 57 degrees of freedom
## Multiple R-squared:  0.4275, Adjusted R-squared:  0.4074 
## F-statistic: 21.28 on 2 and 57 DF,  p-value: 1.251e-07

Because chocolate comes first alphabetically, it is the reference group and the “(Intercept)” estimate corresponds to the estimate of the group-level mean for chocolate. The other two estimates are contrasts between the other groups and this reference group, i.e. “xstrawberry” is the estimated difference between the group mean for strawberry and the reference group.

If we wish instead to use a means paramaterization, we need to supress the intercept term in our model as follows:

m <- lm(y ~ 0 + x)
summary(m)
## 
## Call:
## lm(formula = y ~ 0 + x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.01885 -0.53330 -0.08907  0.60145  2.57251 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## xchocolate    2.9887     0.2162  13.823  < 2e-16 ***
## xstrawberry   1.0743     0.2162   4.969 6.49e-06 ***
## xvanilla      1.5461     0.2162   7.151 1.78e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.967 on 57 degrees of freedom
## Multiple R-squared:  0.824,  Adjusted R-squared:  0.8148 
## F-statistic: 88.96 on 3 and 57 DF,  p-value: < 2.2e-16

Arguably, this approach is more useful because it simplifies the construction of confidence intervals for the group means:

confint(m)
##                2.5 %   97.5 %
## xchocolate  2.555779 3.421713
## xstrawberry 0.641367 1.507301
## xvanilla    1.113126 1.979061

Checking assumptions

Our assumptions in this simple one way ANOVA context are identical to our assumptions with linear regression. Specifically, we assumed that our errors are independently and identically distributed, and that the variance is constant for each group (homoskedasticity). The built in plot method for lm objects is designed to return diagnostic plots that help to check these assumptions.

par(mfrow=c(2, 2))
plot(m)

General linear models

We have covered a few special cases of general linear models, which are usually written as follows:

\[y \stackrel{iid}{\sim} N(X \beta, \sigma^2)\]

Where \(y\) is a vector consisting of \(n\) observations, \(X\) is a “design” matrix with \(n\) rows and \(p\) columns, and \(\beta\) is a vector of \(p\) parameters. There are multivariate general linear models (e.g., MANOVA) where the response variable is a matrix and a covariance matrix is used in place of a scalar variance parameter, but we’ll stick to univariate models for simplicity. The key point here is that the producct of \(X\) and \(\beta\) provides the mean of the normal distribution from which \(y\) is drawn. From this perspective, the difference between the model of the mean, linear regression, ANOVA, etc., lies in the structure of \(X\) and subsequent interpretation of the parameters \(\beta\). This is a very powerful idea that unites many superficially disparate approaches. It also is the reason that these models are considered “linear”, even though a regression line might by quite non-linear (e.g., polynomial regression). These models are linear in their parameters, meaning that our expected value for the response \(y\) is a linear combination (formal notion) of the parameters. If a vector of expected values for \(y\) in some model cannot be represented as \(X \beta\), then it is not a linear model.

In the model of the mean, \(X\) is an \(n\) by \(1\) matrix, with each element equal to \(1\) (i.e. a vector of ones). With linear regression, \(X\)’s first column is all ones (corresponding to the intercept parameter), and the second column contains the values of the covariate \(x\). In ANOVA, the design matrix \(X\) will differ between the means and effects parameterizations. With a means parameterization, the entries in column \(j\) will equal one if observation (row) \(i\) is in group \(j\), and entries are zero otherwise. If you are not comfortable with matrix multiplication, it’s worth investing some effort so that you can understand why \(X\beta\) is such a powerful construct.

Can you figure out the structure of \(X\) with R’s default effects parameterization? You can check your work with model.matrix(m), where m is a model that you’ve fitted with lm.

Interactions between covariates

Often, the effect of one covariate depends on the value of another covariate. This is referred to as “interaction” between the covariates. Interactions can exist between two or more continuous and/or nominal covariates. These situations have special names in the classical statistics literature. For example, models with interactions between nominal covariates fall under “factorial ANOVA”, those with interactions between a continuous and a nominal covariate are referred to as “analysis of covariance (ANCOVA)”. Here we prefer to consider these all as special cases of general linear models.

Interactions between two continuous covariates

Here we demonstrate simulation and estimation for a model with an interaction between two continuous covariates. Notice that in the simulation, we have exploited the \(X \beta\) construct to generate a vector of expected values for \(y\).

n <- 50
x1 <- rnorm(n)
x2 <- rnorm(n)
beta <- c(.5, 1, -1, 2)
sigma <- 1
X <- matrix(c(rep(1, n), x1, x2, x1 * x2), nrow=n)
mu_y <- X %*% beta
y <- rnorm(n, mu_y, sigma)
m <- lm(y ~ x1 * x2)
summary(m)
## 
## Call:
## lm(formula = y ~ x1 * x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9114 -0.5290  0.2147  0.7333  1.7652 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.4046     0.1714   2.360   0.0226 *  
## x1            1.0458     0.1933   5.411 2.19e-06 ***
## x2           -0.7238     0.1555  -4.655 2.77e-05 ***
## x1:x2         2.2430     0.1732  12.950  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.145 on 46 degrees of freedom
## Multiple R-squared:  0.8295, Adjusted R-squared:  0.8184 
## F-statistic:  74.6 on 3 and 46 DF,  p-value: < 2.2e-16

Visualizing these models is tricky, because we are in 3d space (with dimensions \(x_1\), \(x_2\), and \(y\)), but contour plots can be effective and leverage peoples’ understanding of topographic maps.

# visualizing the results in terms of the linear predictor
lo <- 40
x1seq <- seq(min(x1), max(x1), length.out = lo)
x2seq <- seq(min(x2), max(x2), length.out = lo)
g <- expand.grid(x1=x1seq, x2=x2seq)
g$e_y <- beta[1] + beta[2] * g$x1 + beta[3] * g$x2 + beta[4] * g$x1 * g$x2
ggplot(g, aes(x=x1, y=x2)) + 
  geom_tile(aes(fill=e_y)) + 
  stat_contour(aes(z=e_y), col='grey') + 
  scale_fill_gradient2() + 
  geom_point(data=data.frame(x1, x2))

Alternatively, you might check out the effects package:

library(effects)
plot(allEffects(m))

Interactions between two categorical covariates

Here we demonstrate interaction between two categorical covariates, using the diamonds dataset which is in the ggplot2 package.t We are interested in the relationship between diamond price, cut quality, and color.

str(ToothGrowth)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
ToothGrowth$dose <- factor(ToothGrowth$dose)
ggplot(ToothGrowth, aes(x=interaction(dose, supp), y=len)) + 
  geom_point()

In general, visualizing the raw data is a good idea. However, we might also be interested in a table with group-wise summaries, such as the sample means, standard deviations, and sample sizes.

library(dplyr)
ToothGrowth %>%
  group_by(dose, supp) %>%
  summarize(mean = mean(len), 
            sd = sd(len), 
            n = n())
## Source: local data frame [6 x 5]
## Groups: dose [?]
## 
##     dose   supp  mean       sd     n
##   (fctr) (fctr) (dbl)    (dbl) (int)
## 1    0.5     OJ 13.23 4.459709    10
## 2    0.5     VC  7.98 2.746634    10
## 3      1     OJ 22.70 3.910953    10
## 4      1     VC 16.77 2.515309    10
## 5      2     OJ 26.06 2.655058    10
## 6      2     VC 26.14 4.797731    10

We can construct a model to estimate the effect of dose, supplement, and their interaction.

m <- lm(len ~ dose * supp, data = ToothGrowth)
summary(m)
## 
## Call:
## lm(formula = len ~ dose * supp, data = ToothGrowth)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8.20  -2.72  -0.27   2.65   8.27 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    13.230      1.148  11.521 3.60e-16 ***
## dose1           9.470      1.624   5.831 3.18e-07 ***
## dose2          12.830      1.624   7.900 1.43e-10 ***
## suppVC         -5.250      1.624  -3.233  0.00209 ** 
## dose1:suppVC   -0.680      2.297  -0.296  0.76831    
## dose2:suppVC    5.330      2.297   2.321  0.02411 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.631 on 54 degrees of freedom
## Multiple R-squared:  0.7937, Adjusted R-squared:  0.7746 
## F-statistic: 41.56 on 5 and 54 DF,  p-value: < 2.2e-16

This summary gives the effects-parameterization version of the summary. The “(Intercept)” refers to the combination of factor levels that occur first alphabetically: in this case, a dose of 0.5 with the “OJ” supplement. The coefficients for dose1 and dose2 represent estimated contrasts for these two groups relative to the intercept. The coefficient for suppVC represents the contrast between the “VC” and “OJ” levels of supplement when the dose is 0.5. The interaction terms represent the difference in the effect of VC for dose1 and dose2 relative to a dose of 0.5. None of this is particularly intuitive, but this information can be gleaned by inspecting the design matrix \(X\) produced by lm in the process of fitting the model. Inspecting the design matrix along with the dataset to gives a better sense for how \(X\) relates to the factor levels:

cbind(model.matrix(m), ToothGrowth)
##    (Intercept) dose1 dose2 suppVC dose1:suppVC dose2:suppVC  len supp dose
## 1            1     0     0      1            0            0  4.2   VC  0.5
## 2            1     0     0      1            0            0 11.5   VC  0.5
## 3            1     0     0      1            0            0  7.3   VC  0.5
## 4            1     0     0      1            0            0  5.8   VC  0.5
## 5            1     0     0      1            0            0  6.4   VC  0.5
## 6            1     0     0      1            0            0 10.0   VC  0.5
## 7            1     0     0      1            0            0 11.2   VC  0.5
## 8            1     0     0      1            0            0 11.2   VC  0.5
## 9            1     0     0      1            0            0  5.2   VC  0.5
## 10           1     0     0      1            0            0  7.0   VC  0.5
## 11           1     1     0      1            1            0 16.5   VC    1
## 12           1     1     0      1            1            0 16.5   VC    1
## 13           1     1     0      1            1            0 15.2   VC    1
## 14           1     1     0      1            1            0 17.3   VC    1
## 15           1     1     0      1            1            0 22.5   VC    1
## 16           1     1     0      1            1            0 17.3   VC    1
## 17           1     1     0      1            1            0 13.6   VC    1
## 18           1     1     0      1            1            0 14.5   VC    1
## 19           1     1     0      1            1            0 18.8   VC    1
## 20           1     1     0      1            1            0 15.5   VC    1
## 21           1     0     1      1            0            1 23.6   VC    2
## 22           1     0     1      1            0            1 18.5   VC    2
## 23           1     0     1      1            0            1 33.9   VC    2
## 24           1     0     1      1            0            1 25.5   VC    2
## 25           1     0     1      1            0            1 26.4   VC    2
## 26           1     0     1      1            0            1 32.5   VC    2
## 27           1     0     1      1            0            1 26.7   VC    2
## 28           1     0     1      1            0            1 21.5   VC    2
## 29           1     0     1      1            0            1 23.3   VC    2
## 30           1     0     1      1            0            1 29.5   VC    2
## 31           1     0     0      0            0            0 15.2   OJ  0.5
## 32           1     0     0      0            0            0 21.5   OJ  0.5
## 33           1     0     0      0            0            0 17.6   OJ  0.5
## 34           1     0     0      0            0            0  9.7   OJ  0.5
## 35           1     0     0      0            0            0 14.5   OJ  0.5
## 36           1     0     0      0            0            0 10.0   OJ  0.5
## 37           1     0     0      0            0            0  8.2   OJ  0.5
## 38           1     0     0      0            0            0  9.4   OJ  0.5
## 39           1     0     0      0            0            0 16.5   OJ  0.5
## 40           1     0     0      0            0            0  9.7   OJ  0.5
## 41           1     1     0      0            0            0 19.7   OJ    1
## 42           1     1     0      0            0            0 23.3   OJ    1
## 43           1     1     0      0            0            0 23.6   OJ    1
## 44           1     1     0      0            0            0 26.4   OJ    1
## 45           1     1     0      0            0            0 20.0   OJ    1
## 46           1     1     0      0            0            0 25.2   OJ    1
## 47           1     1     0      0            0            0 25.8   OJ    1
## 48           1     1     0      0            0            0 21.2   OJ    1
## 49           1     1     0      0            0            0 14.5   OJ    1
## 50           1     1     0      0            0            0 27.3   OJ    1
## 51           1     0     1      0            0            0 25.5   OJ    2
## 52           1     0     1      0            0            0 26.4   OJ    2
## 53           1     0     1      0            0            0 22.4   OJ    2
## 54           1     0     1      0            0            0 24.5   OJ    2
## 55           1     0     1      0            0            0 24.8   OJ    2
## 56           1     0     1      0            0            0 30.9   OJ    2
## 57           1     0     1      0            0            0 26.4   OJ    2
## 58           1     0     1      0            0            0 27.3   OJ    2
## 59           1     0     1      0            0            0 29.4   OJ    2
## 60           1     0     1      0            0            0 23.0   OJ    2

Often, researchers want to know if interactions need to be included in the model. From a null hypothesis significance testing perspective, we can evaluate the ‘significance’ of the interaction term as follows:

anova(m)
## Analysis of Variance Table
## 
## Response: len
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## dose       2 2426.43 1213.22  92.000 < 2.2e-16 ***
## supp       1  205.35  205.35  15.572 0.0002312 ***
## dose:supp  2  108.32   54.16   4.107 0.0218603 *  
## Residuals 54  712.11   13.19                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We find that the interaction between dose and supplement is statistically significant, meaning that if we assume that there is no interaction, it is unlikely to observe data that are as or more extreme as what we have observed over the course of infinitely many replicated experiments that will probably never occur. Although this is far from intuitive, this approach has been widely used. We will introduce a more streamlined procedure in chapter 3 that 1) does not assume that the effect is zero to begin with, and 2) does not necessarily invoke a hypothetical infinite number of replicated realizations of the data, conditional on one particular parameter value. An alternative approach would be to use information theoretics to decide whether the interaction is warranted:

m2 <- lm(len ~ dose + supp, data = ToothGrowth)
AIC(m, m2)
##    df      AIC
## m   7 332.7056
## m2  5 337.2013

In the past decade following Burnham and Anderson’s book on the topic, ecologists have leaned heavily on Akaike’s information criterion (AIC), which is a relative measure of model quality (balancing goodness of fit with model complexity). Here we see that the original model m with interaction has a lower AIC value, and is therefore better supported. AIC can be considered to be similar to cross validation, approximating the ability of a model to predict future data.

Being somewhat lazy, we might again choose to plot the results of this model using the effects package.

plot(allEffects(m))

This is less than satisfying, as it does not show any data. All we see is model output. If the model is crap, then the output and these plots are also crap. But, evaluating the crappiness of the model is difficult when there are no data shown. Ideally, the data can be shown along with the estimated group means and some indication of uncertainty. If we weren’t quite so lazy, we could use the predict function to obtain confidence intervals for the means of each group.

# construct a new data frame for predictions
g <- expand.grid(supp = levels(ToothGrowth$supp), 
                 dose = levels(ToothGrowth$dose))
p <- predict(m, g, interval = 'confidence', type='response')
predictions <- cbind(g, data.frame(p))

ggplot(ToothGrowth, aes(x=interaction(dose, supp), y=len)) + 
  geom_segment(data=predictions, 
               aes(y=lwr, yend=upr, 
                   xend=interaction(dose, supp)), col='red') + 
  geom_point(data=predictions, aes(y=fit), color='red', size=2, shape=2) + 
  geom_jitter(position = position_jitter(width=.1), shape=1) + 
  ylab("Length")

This plot is nice because we can observe the data along with the model output. This makes it easier for readers to understand how the model relates to, fits, and does not fit the data. If you wish to obscure the data, you could make a bar plot with error pars to represent the standard errors. Although “dynamite” plots are common, we shall not include one here and we strongly recommend that you never produce such a plot (more here).

Interactions between continuous and categorical covariates

Sometimes, we’re interested in interactions between continuous or numeric covariates and another covariates with discrete categorical levels. Again, this falls under the broad class of models used in analysis of covariance (ANCOVA).

x1 <- rnorm(n)
x2 <- factor(sample(c('A', 'B'), n, replace=TRUE))

# generate slopes and intercepts for the first and second groups
a <- rnorm(2)
b <- rnorm(2)
sigma <- .4

X <- matrix(c(ifelse(x2 == 'A', 1, 0), 
              ifelse(x2 == 'B', 1, 0), 
              ifelse(x2 == 'A', x1, 0), 
              ifelse(x2 == 'B', x1, 0)
            ), nrow=n)

mu_y <- X %*% c(a, b)
y <- rnorm(n, mu_y, sigma)
plot(x1, y, col=x2, pch=19)
legend('topright', col=1:2, legend=c('Group A', 'Group B'), pch=19)

Here the intercepts and slopes are allowed to vary for two groups. We can fit a model with an interaction between these covariates. The intercepts and slopes are estimated separately for the two groups.

m <- lm(y ~ x1 * x2)
summary(m)
## 
## Call:
## lm(formula = y ~ x1 * x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.98741 -0.20117  0.00791  0.20823  0.66992 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.12712    0.07493  -1.696   0.0966 .  
## x1          -0.53550    0.09122  -5.871 4.52e-07 ***
## x2B         -0.17747    0.11014  -1.611   0.1139    
## x1:x2B      -0.47428    0.13849  -3.425   0.0013 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3852 on 46 degrees of freedom
## Multiple R-squared:  0.7395, Adjusted R-squared:  0.7226 
## F-statistic: 43.54 on 3 and 46 DF,  p-value: 1.737e-13

Let’s plot the lines of best fit along with the data.

plot(x1, y, col=x2, pch=19)
legend('topright', col=1:2, legend=c('Group A', 'Group B'), pch=19)
abline(coef(m)[1], coef(m)[2])
abline(coef(m)[1] + coef(m)[3], coef(m)[2] + coef(m)[4], col='red')

The abline function, used above, adds lines to plots based on a y-intercept (first argument) and a slope (second argument). Do you understand why the particular coefficients that we used as inputs provide the desired intercepts and slopes for each group?

Further reading

Schielzeth, H. 2010. Simple means to improve the interpretability of regression coefficients. Methods in Ecology and Evolution 1:103–113.

Enqvist, L. 2005. The mistreatment of covariate interaction terms in linear model analyses of behavioural and evolutionary ecology studies. Animal Behaviour 70:967–971.

Gelman and Hill. 2009. Data analysis using regression and multilevel/hierarchical models. Chapter 3-4.

Chapter 2: Maximum likelihood estimation

Big picture

The likelihood is defined as the probability of the data, conditional on some parameter(s). Having observed some data, we often want to know which particular parameter values maximize the probability of those data. These parameter values are referred to as the maximum likelihood estimates.

The goal here is to connect the notion of a likelihood to probability distributions and models. We can obtain maximum likelihood estimates (MLEs) in a few ways: analytically, with brute force (direct search), and via optimization (e.g., the optim function).

Learning goals

  • definition of likelihood
  • single parameter models: obtaining a MLE with optim
  • model of the mean with unknown variance
  • fitting simple linear models with likelihood
  • assumptions (especially related to independence)
  • inference (what probability does the likelihood function represent?)

What is likelihood?

The likelihood function represents the probability of the data \(y\), conditioned on the parameter(s) \(\theta\). Mathematically, the likelihood is \(p(\pmb{y}|\theta) = \mathcal{L}(\theta | \pmb{y})\), where \(\pmb{y}\) is a (possibly) vector-valued sample of observations from the random variable \(\pmb{Y} = (Y_1, Y_2, ..., Y_n)\). More casually, the likelihood function tells us the probability of having observed the sample that we did under different values of the parameter(s) \(\theta\). It is important to recognize that \(\theta\) is not treated as a random variable in the likelihood function (the data are treated as random variables). The likelihood is not the probability of \(\theta\) conditional on the data \(\pmb{y}\); \(p(y | \theta) \neq p(\theta | y)\). To calculate \(p(\theta | y)\), we’ll need to invert the above logic, and we can do so later with Bayes’ theorem (also known as the law of inverse probability).

Joint probabilities of independent events

You may recall that if we have two events \(A\) and \(B\), and we want to know the joint probability that both events \(A\) and \(B\) occur, we can generally obtain the joint probability as: \(P(A, B) = P(A|B)P(B)\) or \(P(A, B) = P(B|A)P(A)\). However, if the events \(A\) and \(B\) are independent, then \(P(A|B) = P(A)\) and \(P(B|A) = P(B)\). In other words, having observed that one event has happened, the probability of the other event is unchanged. In this case, we obtain the joint probability of two independent events as \(P(A, B)=P(A)P(B)\). This logic extends to more than two independent events: \(P(E_1, E_2, ..., E_n) = \prod_{i=1}^{n} E_i\), where \(E_i\) is the \(i^{th}\) event.

Why does this matter? Recall the independence assumption that we made in the construction of our linear models in the previous chapters: the error terms \(\epsilon_i \sim N(0, \sigma^2)\), or equivalently the conditional distribution of y values \(y_i\), \([y_i | \beta, \sigma^2] \sim N(X \beta, \sigma^2)\) are independent. Here the square brackets are used as a more compact version of probability notation, we could have also written \(P(Y_i = y_i | \beta, \sigma^2)\), the probability that the random variable \(Y_i\) equals a particular value \(y_i\) conditional on the parameters. The residual error term of observation \(i=1\) tells us nothing about the error term for \(i=2\), and conditional on a particular \(\beta\) and \(\sigma^2\), \(y_{1}\) tells us nothing about \(y_2\). If we assume that our observations are conditionally independent (conditioning on our parameter vector \(\theta = (\beta, \sigma^2)\)), then we can simply multiply all of the point-wise likelihoods together to find the joint probability of our sample \(\pmb{y}\) conditional on the parameters (the likelihood of our sample):

\[p(y_1, y_2, ..., y_n |\theta) = p(y_1 | \theta) p(y_2 | \theta) ... p(y_n | \theta)\] \[p(\pmb{y} | \theta) = \prod_{i=1}^{n} p(y_i | \theta)\] \[\mathcal{L}(\theta | \pmb{y}) = \prod_{i=1}^{n} p(y_i | \theta)\]

If the observations \(y_1, ..., y_n\) are not conditionally independent (or if you like, if the error terms are not independent), then a likelihood function that multiplies the point-wise probabilities together as if they are independent events is no longer valid. This is the problem underlying many discussions of non-independence, psuedoreplication, and autocorrelation (spatial, temporal, phylogenetic): all of these lead to violations of this independence assumption, meaning that it is not correct to work with the product of all the point-wise likelihoods unless terms are added to the model (e.g., blocking factors, autoregressive terms, spatial random effects) so that the observations are conditionally indepenent.

Obtaining maximum likelihood estimates

We have already obtained quite a few maximum likelihood estimates (MLEs) in the previous chapter with the lm() function. Here, we provide a more general treatment of estimation.

Assuming that we have a valid likelihood function \(\mathcal{L}(\theta | \pmb{y})\), we often seek to find the parameter values that maximize the probability of having observed our sample \(\pmb{y}\). We can proceed in a few different ways, analytically, by direct search, and by optimization for example. Usually the likelihood function is computationally and analytically more tractable on a log scale, so that we often end up working with the log-likelihood rather than the likelihood directly. This is fine, because any parameter(s) \(\theta\) that maximize the likelihood will also maximize the log-likelihood and vice versa, because the log function is strictly increasing. Mathematically, we might refer to a maximum likelihood estimate as the value of \(\theta\) that maximizes \(p(\pmb{y} | \theta)\). Recalling some calculus, it is reasonable to think that we might attempt to differentiate \(p(\pmb{y} | \theta)\) with respect to \(\theta\), and find the points at which the derivative equal zero to identify candidate maxima. The first derivative will be zero at a maximum, but also at any minima or inflection points, so in practice first-order differentiation alone is not sufficient to identify MLEs. In this class, we won’t worry about analytically deriving MLEs, but for those who are interested and have some multivariate calculus chops, see Casella and Berger’s 2002 book Statistical Inference.

So, we’ve established that the likelihood is: \(p(y | \theta) = \prod_{i=1}^n p(y_i | \theta)\). Computationally, this is challenging because we are working with really small numbers (products of small numbers) - so small that our computers have a hard time keeping track of them with much precision. Summing logs of small numbers is more computationally stable than multiplying many small numbers together. So, let’s instead work with the log likelihood by taking the log of both sides of the previous equation.

\[log(p(y|\theta)) = log \big(\prod_{i=1}^n p(y_i | \theta) \big)\]

Because \(log(ab) = log(a) + log(b)\), we can sum up the log likelihoods on the right side of the equation:

\[log(p(y|\theta)) = \sum_{i=1}^n log(p(y_i | \theta))\]

Optimization

Finding maxima and minima of functions is a common operation, and there are many algorithms that have been developed to accomplish these tasks. Some of these algorithms are included in the base R function optim(). We need to provide some initial values for the parameters and a function to minimize (if we find the minimum of the negative log-likelihood, then we have found the MLE).

# writing a negative log-likelihood function
nll <- function(theta, y){
  mu <- theta[1]
  sigma <- exp(theta[2]) 
  # theta[2] is the log of sigma, which is unconstrained
  # this simplifies optimization because we have no boundaries
  -sum(dnorm(y, mu, sigma, log=TRUE))
}

# initial guesses
theta_init <- c(mu = 4, log_sigma = 1)

# optimize
res <- optim(theta_init, nll, y=y)
res
## $par
##        mu log_sigma 
##  6.934133  1.155004 
## 
## $value
## [1] 128.7018
## 
## $counts
## function gradient 
##       65       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

If the algorithm has converged (check to see if res$convergence is zero), and if there is only one minimum in the negative log-likelihood surface (we know this is true based on analytical results in this case), then we have identified the MLEs of \(\mu\) and \(ln(\sigma)\). How do these estimates compare to our first estimates found via direct search?

MLE_optim <- c(res$par[1], exp(res$par[2]))
rbind(unlist(MLE_dsearch[c('mu', 'sigma')]), 
      unlist(MLE_optim))
##            mu    sigma
## [1,] 6.949495 3.171717
## [2,] 6.934133 3.174036

This approach is quite general, and can be modified to be used for instance in a linear regression context:

n <- 20
x <- runif(n)
y <- rnorm(n, 3 + 10 * x, sd = 1)

nll <- function(theta, y, x){
  alpha <- theta[1]
  beta <- theta[2]
  sigma <- exp(theta[3]) 
  mu <- alpha + beta * x
  -sum(dnorm(y, mu, sigma, log=TRUE))
}

# initial guesses
theta_init <- c(alpha = 4, beta = 1, log_sigma = 1)

# optimize
res <- optim(theta_init, nll, y = y, x = x)
res
## $par
##      alpha       beta  log_sigma 
##  3.8302509  9.8125532 -0.1694532 
## 
## $value
## [1] 24.99241
## 
## $counts
## function gradient 
##      148       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL
plot(x, y)
abline(a = res$par['alpha'], b = res$par['beta'])

Direct search and optimization approaches for maximum likelihood estimation can be useful, but in this class we will rarely make use of direct search and optimization. However, the likelihood function and maximum likelihood estimation will continue to play a central role.

Further reading

Casella and Berger. 2002. Statistical Inference, Chapter 7.

Scholz FW. 2004. Maximum likelihood estimation, in Encyclopedia of Statistical Sciences.

Gelman and Hill. 2009. Data analysis using regression and multilevel/hierarchical models. Chapter 18.

Chapter 3: Bayesian inference

Big picture

In recent decades Bayesian inference has increasingly been used in ecology. A key difference between Bayesian and maximum likelihood approaches lies in which quantities are treated as random variables. For any likelihood function, the parameters \(\theta\) are not random variables because there are no probability distributions associated with \(\theta\). In contrast, Bayesian approaches treat parameters and future data (actually all unobserved quantities) as random variables, meaning that each unknown is associated with a probability distribution \(p(\theta)\). In both Bayesian and maximum likelihood approaches, the observations \(y\) are treated as a random variable, and the probability distribution for this random variable \(p(y \mid \theta)\) plays a central role in both approaches. See Hobbs and Hooten (Ch. 5) for a more detailed treatment on differences between Bayesian and maximum likelihood based inferential approaches. Some authors also point out differences between Bayesian and frequentist definitions of probability. In a frequentist framework, probabilities are defined in term of long run frequencies of events, often relying on hypothetical realizations. Bayesian probability definitions do not rely on the long-run frequency of events.

Philosophy aside, Bayesian approaches have become more popular because of intuitive appeal and practical advantages. Intuitively, it can seem backwards to focus on the probability of the data given the parameters \(p(y \mid \theta)\). What we really want to know is the probability of the parameters, having observed some data \(p(\theta \mid y)\). As we will see, Bayes’ theorem allows us to calculate this probability. Second, Bayesian approaches are often easier to implement than maximum likelihood or frequentist approaches, particularly for complex models. Finally, we find that in many applications, Bayesian approaches facilitate a better understanding of model structure and assumptions.

Learning goals

  • Bayes’ theorem and Bayesian probability
  • relationsihp between likelihood and Bayesian inference
  • priors (generally, informative vs. non-informative)
  • proper vs. improper priors
  • intro to Bayesian computation and MCMC
  • posterior summaries and comparisons
  • single parameter models: MLE vs. Bayesian treatments
  • Bayesian linear regression: intro to Stan

Bayes’ theorem

Bayes’ theorem is an incredibly powerful theorem that follows from the rules of probability. To prove the theorem, we need only a few ingredients: 1) the definition of joint probabilities \(p(A, B) = p(A \mid B) p(B)\) or \(p(A, B) = p(B \mid A)p(A)\) (both are valid) and 2) a bit of algebra.

\[p(A, B) = p(A \mid B) p(B)\]

\[p(B \mid A)p(A) = p(A \mid B) p(B)\]

\[p(B \mid A) = \dfrac{p(A \mid B) p(B)}{p(A)}\]

This is Bayes’ theorem. In modern applications, we typically substitute unknown parameters \(\theta\) for \(B\), and data \(y\) for \(A\):

\[p(\theta \mid y) = \dfrac{p(y \mid \theta) p(\theta)}{p(y)}\]

The terms can verbally be described as follows:

  • \(p(\theta \mid y)\): the posterior distribution of the parameters. This tells us what the parameters probably are (and are not), conditioned on having observed some data \(y\).
  • \(p(y \mid \theta)\): the likelihood of the data \(y\).
  • \(p(\theta)\): the prior distribution of the parameters. This should represent our prior knowledge about the values of the parameters. Prior knowledge comes from similar studies and/or first principles.
  • \(p(y)\): the marginal distribution of the data. This quantity can be difficult or even impossible to compute, will always be a constant after the data have been observed, and is often ignored.

Because \(p(y)\) is a constant, it is valid and common to consider the posterior distribution up to this proportionality constant:

\[p(\theta \mid y) \propto p(y \mid \theta) p(\theta)\]

Prior distributions

We have already learned about likelihood, but the introduction of a prior distribution for the parameters requires some attention. The inclusion of prior information is not unique to Bayesian inference. When selecting study systems, designing experiements, cleaning or subsetting data, and choosing a likelihood function, we inevitably draw upon our previous knowledge of a system.

From a Bayesian perspective, prior distributions \(p(\theta)\) represent our knowledge/beliefs about parameters before having observed the data \(y\), and the posterior distribution represents our updated knowledge/beliefs about parameters after having observed our data. This is similar to the way many scientists operate: we think we know something about a system, and then we do experiments and conduct surveys to update our knowledge about the system. But, the observations generated from the experiments and surveys are not considered in isolation. They are considered in the context of our previous knowledge. In this way, the posterior distribution represents a compromise between our prior beliefs and the likelihood.

Analytical posterior with conjugate priors: Bernoulli case*

* this section is a bit math-heavy for illustration, but most of the time we won’t find the posterior analytically

The Bernoulli distribution is a probability distribution for binary random variables (e.g., those that take one of two values: dead or alive, male or female, heads or tails, 0 or 1, success or failure, and so on). The Bernoulli distribution has one parameter, \(p:\) the probability of “success” (or more generally, the probability of one of the two outcomes) in one particular event. A Bernoulli random variable takes one of these two values. The choice of which of the two possible outcomes is considered “success” is often arbitrary - we could consider either “heads” or “tails” to be a success if we wanted to. If \(p\) is the probability of success in one particular trial, then the probability of failure is just the complement of \(p:\) \(1 - p\), sometimes referred to as \(q\), such that \(p + q = 1\). For those familiar with the Binomial distribution, the Bernoulli distribution is a special case of the Binomial, with one trial \(k=1\). We can use the Bernoulli distribution as a likelihood function, where \(y\) is either zero or one: \(y \in \{0, 1\}\), and \(p\) is our only parameter. Because p is a probability, we know that \(0 \leq p \leq 1\).

We can express the likelihood for a Bernoulli random variable as follows.

\[p(y \mid p) = \begin{cases} p &\mbox{if } y = 1 \\ 1-p & \mbox{if } y = 0 \end{cases}\]

Equivalently and more generally:

\[[y \mid p] = p^y (1-p)^{1-y}\]

If we have \(n\) independent Bernoulli random variables, \(y_1, ..., y_n\), each with probability \(p\), then the joint likelihood can be written as the product of the point-wise likelihoods:

\[[y_1, ..., y_n \mid p] = p^{y_1} (1-p)^{1 - y_1} ... p^{y_n} (1 - p)^{1 - y_n}\]

\[[\pmb{y} \mid p] = \prod_{i = 1}^{n} p^{y_i} (1 - p)^{1 - y_1}\]

Recalling from algebra that \(x^a x^b = x^{a + b}\), this implies:

\[[\pmb{y} \mid p] = p^{\sum_i y_i} (1 - p)^{n - \sum_i y_i}\]

Having obtained the likelihood, we now must specify a prior to complete our specification of the joint distribution of data and parameters \([y \mid p][p]\), the numerator in Bayes’ theorem. A natural choice is the Beta distribution, which has two parameters \(\alpha\) and \(\beta\), with support on the interval \((0, 1)\). This is a good choice because its bounds are similar to those for probabilities, and the posterior induced by the prior is also a Beta distribution. When a prior distribution for a parameter induces a posterior distribution that is of the same form (same probability distribution) as the prior, the prior is said to be a “conjugate prior” for the likelihood. The density of the beta distribution for parameter \(p\) has two parameters \(\alpha\) and \(\beta\) and is defined as:

\[[p] = c p^{\alpha - 1} (1 - p)^{\beta - 1}\]

Where \(c\) is a constant that ensures that \([p \mid \alpha, \beta]\) integrates to one over the interval \((0, 1)\) (i.e., it is a true probability distribution), with \(c=\dfrac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}\), and \(\Gamma(x) = (x - 1)!\) That’s a factorial symbol (\(!\)), not a puncutation mark! To give a bit of intuition for what the beta distribution looks like, here are some plots of the beta density with different values of \(\alpha\) and \(\beta\):

alpha <- rep(c(1, 5, 10))
beta <- rep(c(1, 5, 10))
g <- expand.grid(alpha=alpha, beta=beta)
x <- seq(0, 1, .005)

par(mfrow=c(3, 3))
for (i in 1:nrow(g)){
  plot(x, dbeta(x, g$alpha[i], g$beta[i]), 
       type='l', ylim=c(0, 10), 
       ylab="[x]", lwd=3)
  title(bquote(alpha == .(g$alpha[i]) ~ ", " ~ beta == .(g$beta[i])))
}

One commonly used prior is the beta(1, 1), because it corresponds to a uniform prior for \(p\) (shown in the top left corner). Now we have all of the ingredients to embark on our first Bayesian analysis for a Bernoulli random variable. We’ll proceed by finding the posterior distribution up to some proportionality constant, then we’ll use our knowledge of the beta distribution to recover the correct proportionality constant:

\[[p | y] = \dfrac{[y \mid p] [p]}{[y]}\]

\[[p | y] \propto [y \mid p] [p]\]

Plugging in the likelihood and prior that we described above:

\[[p | y] \propto p^{\sum_i y_i} (1 - p)^{n - \sum_i y_i} [p]\]

\[[p | y] \propto p^{\sum_i y_i} (1 - p)^{n - \sum_i y_i} c p^{\alpha - 1} (1 - p)^{\beta - 1}\]

Dropping \(c\), because we’re only working up to some proportionality constant:

\[[p | y] \propto p^{\sum_i y_i} (1 - p)^{n - \sum_i y_i} p^{\alpha - 1} (1 - p)^{\beta - 1}\]

Then, again recalling that \(x^a x^b = x^{a + b}\), we find that

\[[p | y] \propto p^{\alpha -1 + \sum_i y_i} (1 - p)^{\beta - 1 +n - \sum_i y_i}\]

Notice that this is of the same form as the beta prior for \(p\), with updated parameters: \(\alpha_{post} = \alpha + \sum_i y_i\) and \(\beta_{post} = \beta + n - \sum_i y_i\). In this sense, the parameters of the beta prior \(\alpha\) and \(\beta\) can be interpreted as the previous number of successes and failures, respectively. Future studies can simply use the updated values \(\alpha_{post}\) and \(\beta_{post}\) as priors. We have found a quantity that is proportional to the posterior distribution, which often is enough, but here we can easily derive the proportionality constant that will ensure that the posterior integrates to one (i.e., it is a true probablity distribution).

Recall the proportionality constant \(c\) from the beta distribution prior that we used, which is \(c=\dfrac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}\). Updating this proportionality constant gives us the correct value for our posterior distribution, ensuring that it integrates to one, so that we now can write down the posterior distribution in closed form:

\[[p | y] = \frac{\Gamma(\alpha_{post} + \beta_{post})}{\Gamma(\alpha_{post})\Gamma(\beta_{post})} p^{\alpha_{post} -1} (1 - p)^{\beta_{post} - 1}\]

Now we can explore the effect of our prior distributions on the posterior. Suppose that we have observed \(n=8\) data points, \(y_1, y_2, ..., y_{10}\), with \(y_i \in \{0, 1\}\) and 2 successes, \(\sum_i y_i = 2\). Let’s graph the posterior distributions that are implied by the priors plotted above and the likelihood resulting from these observations.

g$alpha_post <- g$alpha + 2
g$beta_post <- g$beta + 6
  
par(mfrow=c(3, 3))
for (i in 1:nrow(g)){
  plot(x, dbeta(x, g$alpha[i], g$beta[i]), 
       type='l', ylim=c(0, 10), 
       xlab="p", ylab="[p | y]", lwd=3, col='grey')
  lines(x, dbeta(x, g$alpha_post[i], g$beta_post[i]), 
        lwd=3, col='blue')
  lines(x, 8 * dbinom(x=2, size=8, prob=x), col='red')
  title(bquote(alpha == .(g$alpha[i]) ~ ", " ~ beta == .(g$beta[i])))
}

Here the prior is shown in grey, the likelihood is shown in red, and the posterior distribution is shown in blue. Notice how strong priors pull the posterior distribution toward them, and weak priors result in posteriors that are mostly affected by the data. The majority of Bayesian statisticians nowadays advocate for the inclusion of reasonable prior distributions rather than prior distributions that feign ignorance. One nice feature of Bayesian approaches is that, given enough data, the prior tends to have less of an influence on the posterior. This is consistent with the notion that when reasonable people are presented with strong evidence, they tend to more or less agree even if they may have disagreed ahead of time (though at times the empirical evidence for this phenomenon may seem somewhat equivocal). To show this, let’s increase the amount of information in our data, so that we have \(n=800\) and \(\sum y = 200\) successes.

g$alpha_post <- g$alpha + 200
g$beta_post <- g$beta + 600
  
par(mfrow=c(3, 3))
for (i in 1:nrow(g)){
  plot(x, dbeta(x, g$alpha[i], g$beta[i]), 
       type='l', ylim=c(0, 32), 
       xlab="p", ylab="[p | y]", col='grey', lwd=2)
  lines(x, dbeta(x, g$alpha_post[i], g$beta_post[i]), 
        col='blue', lwd=2)
    lines(x, 800 * dbinom(x=200, size=800, prob=x), col='red')
  title(bquote(alpha == .(g$alpha[i]) ~ ", " ~ beta == .(g$beta[i])))
}

Here again the prior is shown in grey, the likelihood is shown in red, and the posterior distribution is shown in blue. The posterior distribution has essentially the same shape as the likelihood because we have much more information in this larger data set relative to our priors.

Improper priors

Improper priors do not integrate to one (they do not define proper probability distributions). For instance, a normal distribution with infinite variance will not integrate to one. Sometimes improper priors can still lead to proper posteriors, but this must be checked analytically. Unless you’re willing to prove that an improper prior leads to a proper posterior distribution, we recommend using proper priors.

Posterior computation the easy way

In reality, most of the time we don’t analytically derive posterior distributions. Mostly, we can express our models in specialized programming languages that are designed to make Bayesian inference easier. Here is a Stan model statement from the above example.

data {
  int n;
  int<lower=0, upper=1> y[n];
}

parameters {
  real<lower=0, upper=1> p;
}

model {
  p ~ beta(1, 1);
  y ~ bernoulli(p);
}

The above model statement has three “blocks”. The data block specifies that we have two fixed inputs: the sample size \(n\) and a vector of integer values with \(n\) elements, and these have to be either zero or one. The parameter block specifies that we have one parameter \(p\), which is a real number between 0 and 1. Last, the model block contains our beta prior for \(p\) and our Bernoulli likelihood for the data \(y\).

Stan does not find the analytical expression for the posterior. Rather, Stan translates the above program into a Markov chain Monte Carlo (MCMC) algorithm to simulate samples from the posterior distribution. In practice, this is sufficient to learn about the model parameters.

What is MCMC?

MCMC is used to sample from probability distributions generally, and from posterior probability distributions in the context of Bayesian inference. This is a huge topic, and involves a fair bit of theory that we will not dwell on here, but it is worth reading Chapter 18 in Gelman and Hill for a taste of some of the algorithms that are used. In this class, we will rely on specialized software to conduct MCMC simulations, so that many of the details are left “under the hood”. However, it is still important to know what MCMC is (supposed to be) doing, and how to identify when MCMC algorithms fail.

In a Bayesian context, we often run multiple Markov chain simulations, where we iteratively update our parameter vector \(\theta\). If all goes well, then eventually each Markov chain converges to the posterior distribution of the parameters, so that every draw can be considered a simulated sample from the posterior distribution. Typically, we initialize the chains at different (often random) points in parameter space. If all goes well, then after some number of iterations, every chain has converged to the posterior distribution, and we run the chains a bit longer to generate a representatitve sample from the posterior, then perform inference on our sample. Here is what this looks like in a 2-d (two parameter) model of the mean situation.

Notice that the chains start dispersed in parameter space and eventually all converge to the same region and stay there. After some large number of iterations, the estimated posterior density stabilizes. At this point, usually the early draws from the Markov chains are discarded as “warmup” or “burnin”, as these do not represent draws from the posterior, because the chains had not coverged.

It is always necessary to run diagnostics on MCMC output to ensure convergence. Some of these are graphical. Traceplots show the parameter values that a Markov chain has taken on the y-axis, and iteration number on the x-axis. For instance, if we run our Stan model from before:

library(rstan)
m <- '
data {
  int n;
  int<lower=0, upper=1> y[n];
}

parameters {
  real<lower=0, upper=1> p;
}

model {
  p ~ beta(1, 1);
  y ~ bernoulli(p);
}
'

y <- c(1, 1, 0, 0, 0, 0, 0, 0)
n <- length(y)

out <- stan(model_code=m)
## 
## SAMPLING FOR MODEL '8dbabac1e0f3c5dfe4a2c8e37a6f8ed1' NOW (CHAIN 1).
## 
## Chain 1, Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 1, Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 1, Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 1, Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 1, Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 1, Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 1, Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 1, Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 1, Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 1, Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 1, Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 1, Iteration: 2000 / 2000 [100%]  (Sampling)
## #  Elapsed Time: 0.00543 seconds (Warm-up)
## #                0.005428 seconds (Sampling)
## #                0.010858 seconds (Total)
## 
## 
## SAMPLING FOR MODEL '8dbabac1e0f3c5dfe4a2c8e37a6f8ed1' NOW (CHAIN 2).
## 
## Chain 2, Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 2, Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 2, Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 2, Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 2, Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 2, Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 2, Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 2, Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 2, Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 2, Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 2, Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 2, Iteration: 2000 / 2000 [100%]  (Sampling)
## #  Elapsed Time: 0.005528 seconds (Warm-up)
## #                0.004968 seconds (Sampling)
## #                0.010496 seconds (Total)
## 
## 
## SAMPLING FOR MODEL '8dbabac1e0f3c5dfe4a2c8e37a6f8ed1' NOW (CHAIN 3).
## 
## Chain 3, Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 3, Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 3, Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 3, Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 3, Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 3, Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 3, Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 3, Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 3, Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 3, Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 3, Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 3, Iteration: 2000 / 2000 [100%]  (Sampling)
## #  Elapsed Time: 0.005624 seconds (Warm-up)
## #                0.004917 seconds (Sampling)
## #                0.010541 seconds (Total)
## 
## 
## SAMPLING FOR MODEL '8dbabac1e0f3c5dfe4a2c8e37a6f8ed1' NOW (CHAIN 4).
## 
## Chain 4, Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 4, Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 4, Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 4, Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 4, Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 4, Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 4, Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 4, Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 4, Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 4, Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 4, Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 4, Iteration: 2000 / 2000 [100%]  (Sampling)
## #  Elapsed Time: 0.005566 seconds (Warm-up)
## #                0.004941 seconds (Sampling)
## #                0.010507 seconds (Total)
traceplot(out, inc_warmup=TRUE)

This traceplot is useful for verifying convergence: all of the chains appear to be sampling from the same region. We can also inspect some numerical summaries that are used to detect non-convergence. Specifically, we can look at the \(\hat{R}\) statistic. If \(\hat{R} > 1.1\), then we ought to be worried about convergence of our chains. Printing our model output returns this statistic as well as some other summary statistics for the posterior draws.

out
## Inference for Stan model: 8dbabac1e0f3c5dfe4a2c8e37a6f8ed1.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##       mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## p     0.29    0.00 0.14  0.07  0.18  0.27  0.38  0.60  1253    1
## lp__ -6.68    0.03 0.81 -9.16 -6.87 -6.37 -6.17 -6.11   944    1
## 
## Samples were drawn using NUTS(diag_e) at Sun Nov 22 12:51:12 2015.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

This output also tells us that we ran four MCMC chains, each for 2000 iterations, and the first 1000 iterations were discarded as warmup (the shaded region in the traceplot). For each parameter (and for the log probability up to a proportionality constant lp__), we get the posterior mean, the MCMC standard error of the mean, the posterior standard deviation, some quantiles, and an estimate for the number of effective samples from the posterior n_eff, which should typically be at least in the hundreds.

Example: normal linear models

Recall that all normal linear models can be expressed as:

\[y \sim N(X \beta, \sigma^2)\]

To complete a Bayesian analysis, we need to select prior distributions for the unknown parameters \(\beta\) and \(\sigma^2\). For instance:

\[\beta \sim N(0, 5)\]

\[\sigma \sim halfNormal(0, 5)\],

where the half-Normal with mean zero is a folded Gaussian probability density function with only positive mass:

Let’s do a quick linear regression model and summarize/plot the results.

n <- 20
x <- runif(n, 0, 3)
y <- rnorm(n, -3 + .75 * x, 1)
plot(x, y)

We’ll use Stan again, but this time instead of specifying an object (m above) that is a character string containing our model statement, we’ll save the model file somewhere else with a .stan file extension. For instance, maybe we have a general purpose Stan model that can be used for linear models called lm.stan:

data {
  int n; // sample size
  int p; // number of coefficients
  matrix[n, p] X;
  vector[n] y;
}

parameters {
  vector[p] beta;
  real<lower=0> sigma;
}

model {
  beta ~ normal(0, 5);
  sigma ~ normal(0, 5);
  y ~ normal(X * beta, sigma);
}

So we have this file saved somewhere as lm.stan, and it can fit any of the linear models that we covered in Chapter 1 by changing the design matrix, but now we can include information to improve our estimates. Because this is a simulated example, we’ll use somewhat vague priors. Fitting the model:

library(rstan)
X <- matrix(c(rep(1, n), x), ncol = 2)
stan_d <- list(n = nrow(X), p = ncol(X), X = X, y = y)
out <- stan('lm.stan', data = stan_d)
## 
## SAMPLING FOR MODEL 'lm' NOW (CHAIN 1).
## 
## Chain 1, Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 1, Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 1, Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 1, Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 1, Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 1, Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 1, Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 1, Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 1, Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 1, Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 1, Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 1, Iteration: 2000 / 2000 [100%]  (Sampling)
## #  Elapsed Time: 0.038592 seconds (Warm-up)
## #                0.042849 seconds (Sampling)
## #                0.081441 seconds (Total)
## 
## 
## SAMPLING FOR MODEL 'lm' NOW (CHAIN 2).
## 
## Chain 2, Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 2, Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 2, Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 2, Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 2, Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 2, Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 2, Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 2, Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 2, Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 2, Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 2, Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 2, Iteration: 2000 / 2000 [100%]  (Sampling)
## #  Elapsed Time: 0.035292 seconds (Warm-up)
## #                0.036954 seconds (Sampling)
## #                0.072246 seconds (Total)
## 
## 
## SAMPLING FOR MODEL 'lm' NOW (CHAIN 3).
## 
## Chain 3, Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 3, Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 3, Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 3, Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 3, Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 3, Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 3, Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 3, Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 3, Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 3, Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 3, Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 3, Iteration: 2000 / 2000 [100%]  (Sampling)
## #  Elapsed Time: 0.035088 seconds (Warm-up)
## #                0.038331 seconds (Sampling)
## #                0.073419 seconds (Total)
## 
## 
## SAMPLING FOR MODEL 'lm' NOW (CHAIN 4).
## 
## Chain 4, Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 4, Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 4, Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 4, Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 4, Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 4, Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 4, Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 4, Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 4, Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 4, Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 4, Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 4, Iteration: 2000 / 2000 [100%]  (Sampling)
## #  Elapsed Time: 0.040755 seconds (Warm-up)
## #                0.040418 seconds (Sampling)
## #                0.081173 seconds (Total)
traceplot(out)

There are also some other default plots which are nice:

plot(out)
## Showing 80% intervals
pairs(out)

Notice that the slopes and intercepts are correlated in the posterior (do you recall why?). Also, lp__ is tracked automamatically, and this is proportional to the log probability of the posterior distribution.

Let’s inspect the output in table form:

out
## Inference for Stan model: lm.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##           mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
## beta[1]  -3.23    0.02 0.73  -4.66  -3.70  -3.23  -2.76  -1.80   853    1
## beta[2]   0.76    0.01 0.39  -0.03   0.51   0.76   1.00   1.53   662    1
## sigma     1.34    0.01 0.25   0.96   1.16   1.30   1.48   1.90  1107    1
## lp__    -15.08    0.05 1.34 -18.45 -15.64 -14.73 -14.13 -13.59   886    1
## 
## Samples were drawn using NUTS(diag_e) at Sun Nov 22 12:51:41 2015.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

And finally, let’s extract our samples from the posterior and plot our estimated line of best fit.

library(scales)
post <- extract(out)

# draw points
plot(x, y)

# add a line for each draw from the posterior
n_iter <- length(post$lp__)
for (i in 1:n_iter){
  abline(post$beta[i, 1], post$beta[i, 2], col=alpha('dodgerblue', .05))
}

# add points again so that they are visible over the line
points(x, y, pch=19)

We might be particularly interested in the effect of x on y. If we have abandoned frequentism, then how might we think about this? One way not to think of it is asking what the probability is that the slope is exactly equal to zero. If we think of a posterior probability density of our slope, then the true probability of any one particular value is zero because this is a probability density function, not a probability mass function (this is true for \(\beta=0\) and for \(\beta = .0112351351\), for instance).

plot(density(post$beta[, 2]))

A better question (answerable with a probability density function) might be: given the data, what is the probability that \(x\) has a positive effect on \(y\)? This is equivalent to asking about the area to the right of zero in the posterior distribution of \(\beta\), and the number is approximated simply the proportion of posterior draws greater than zero.

mean(post$beta[, 2] > 0)
## [1] 0.97075
ord <- order(post$beta[, 2])
plot(post$beta[ord, 2], 
     col=ifelse(post$beta[ord, 2] > 0, 'red', 'black'), 
     ylab='Sorted posterior draws for the slope')

We might also construct a 95% credible interval for our estimate to communicate uncertainty in our estimate of \(\beta\).

quantile(post$beta[, 2], probs=c(0.025, 0.975))
##        2.5%       97.5% 
## -0.02518672  1.52906633

Conditional on the data, there is a 95% probability that the true parameter value is in the credible interval. This is different from the interpretation of frequentist confidence intervals, which relies on long-run frequencies for imaginary realizations of the data collection and interval estimation procedure. Our confidence in confidence intervals is in the procedure of creating confidence intervals - in the hypothetical long run, 95% of the intervals that we construct will contain the true value. As pointed out here, this is the right answer to the wrong question.

The credible interval constructed above has equal probability density on either side of the interval because we based our interval end-points on quantiles. Sometimes, we may wish to instead construct an interval that is as narrow as possible while encapsulating some proportion of the probability mass. These intervals are called highest density posterior intervals, and can be constructed using the following function:

HDI <- function(values, percent=0.95){
  sorted <- sort(values)
  index <- floor(percent * length(sorted))
  nCI <- length(sorted) - index
  
  width <- rep(0, nCI)
  for (i in 1:nCI){
    width[i] <- sorted[i + index] - sorted[i]
  }
  
  HDImin <- sorted[which.min(width)]
  HDImax <- sorted[which.min(width) + index]
  HDIlim <- c(HDImin, HDImax)
  return(HDIlim)
}

HDI(post$beta[, 2])
## [1] -0.0537326  1.4818406

In this instance, our posterior is fairly symmetric and the two types of credible intervals will not be much different.

Chapter 4: Poisson models

Here is my preference for teaching Poisson models first. I (personally) prefer to start with poisson because (1) IMO log links are more intuitive than logit (probit etc) links (2) latent variables remain between zero and infinity (3) IMO overdispersion is easier to understand than in binomial models

Priorties

  • non-gaussian data (counts, proportions, binary, exponential)
  • link functions and latent variables
  • Poisson distribution
  • log link as a map (what are we actually estimating – mean or rate)
  • understanding effects sizes in Poisson models
  • dependence of mean and variance in non-Gaussian models
  • overdispersion : quasi-Poisson and negative-binomial
  • overdispersion as unaccounted for variation (maybe a simple example – sex effects)
  • implementation with glm, Stan, JAGS
  • graphical displays
  • model checking

Optional

  • simulation of data & parameter recovery

Chapter 5: Binomial models

Here, the students will continue to use a combination of methods for implementation. Key points to take away from this section include the properties/behavior of bionomial models, ways to check binomial modles, and a hint that Bayesian approaches are going to be more flexible. The binomial-Poisson hierarchical model is a classic that should reinforce the notion that Bayesian approaches will generally be easier for more complex examples.

Priorities

  • binomial distribution (relationship between mean and variance)
  • logit link as a map
  • proportion vs. binary models (will help with understanding hierarchical models later)
  • implementation with glm
  • overdispersion in proportion models and understanding the difference between individual and group level probabilities
  • implementation with Stan, JAGS
  • separation
  • hierarchical model number 1: occupancy model (a classic) (maybe, or we could do it later)
  • review marginalization
  • graphical displays
  • model checking

Optional

  • simulation of data & parameter recovery

Chapter 6: Partial pooling and likelihood

The main dish. I’d like to avoid a recipe-based approach where we discuss varying intercept and varying slope models as primary objectives. Instead, I think it’s important to cover these topics as special cases of the general approach of hierarchical modeling as a means to impose probabilistic structures on parameters. From that perspective, students should be able to better extend these methods for their own work.

Priorities

  • definition
  • review previous examples
  • hyperparameters (they’ve always been there even when we don’t acknowledge them)
  • varying intercepts (NBA freethrow example) with lme4
  • partial pooling
  • clearing up confusion about nestedness
  • simple hierarchical models with likelihood
  • continous predictors for multiple levels

Optional

  • plotting estimates for different levels from lme4 models

Chapter 7: Bayesian hierarchical models

Priorities

  • varying intercepts (NBA freethrow example) with Stan
  • hierarchical models in Stan
  • highlight Bayesian connection to priors
  • classic examples:
    • hierarchical model number 1: occupancy model (a classic)
    • hierarchical model: binomial-Poisson hierarchy (e.g. # eggs laid & survival)
  • introduction to the multivariate normal distribution
  • parameters for hierarchical variance parameters
  • prediction (new vs. observed groups)
  • priors
  • note crossing of the ‘ease’ threshold (?)

Optional

  • posterior prediction
  • basic Bayesian models in MCMCglmm
  • random effects, fixed effects, mixed effects models as special instances of hierarchical linear models

Reading

Gelman, A., J. Hill, and M. Yajima. 2012. Why We (Usually) Don’t Have to Worry About Multiple Comparisons. Journal of Research on Educational Effectiveness 5:189–211.
Gelman and Hill discussion of random effects terminology (very good)

Chapter 8: Hierarchical model construction

This is where I think we will have the greatest impact on students future work. Translating problems to models is a key skill, and it may take a fair bit of practice. Tools to implement include graphical skills (e.g. drawing DAGs), and familiarity with probability distributions.

Chapter 9: Model comparison

I envisions this as occuring a bit more ad hoc during the second half as students start to build their own models

Reading

Hooten, M. B., and N. T. Hobbs. 2015. A guide to Bayesian model selection for ecologists. Ecological Monographs 85:3–28.